DeepSeek V4 #24162
Conversation
|
@am17an I wonder what's the purpose of f32 casts and conts after mulmats here? Removed them and got the same logits. |
|
@fairydreaming it's an artifact of debugging, you can push your changes to this branch (I added you as collaborator) |
|
Played with flash attention this weekend, here's my experimental patch: With FA enabled and added lightning indexer GGML OP compute buffers memory usage got really low, I think processing 1M tokens is achievable on a single RTX PRO 6000 Max-Q with CPU expert offloading (f16 cache type) even with 8k ubatch size. Some performance numbers (Epyc 9374F + RTX PRO 6000 Max-Q): Max memory usage I saw in nvidia-smi was 60836MiB / 97887MiB. Edit: forgot about Pro benchmark results, aborted in the middle but it got to: |
|
@am17an Any specific reason you went with DEEPSEEK_V4_FLASH/deepseek-v4-flash/deepseek_v4_flash when naming things instead of simply DEEPSEEK4/deepseek4/deepseek4? I mean this convention is a bit inconsistent with existing names and the flash part is confusing (sounds like flash only while pro uses this architecture too), maybe it would be better to change it now before it spreads? (I noticed that even the architecture name in GGUF is deepseek-v4-flash, so we'd have to update it in existing GGUF files or reconvert). |
|
I'm going to work on making graph reuse work across various compression boundaries and also make multi-sequence work, along with fixing a couple of issues. After that I think a round of simple optimization + running some evals and then this should be ready for review. Since it's a large PR it may make sense to separate out conversion, chat and then the model into separate PRs. In parallel #24231 + FA can be included when they're ready |
|
@am17an Sounds good, I stared at tensor values for the last few days comparing them with the DeepSeek inference code but haven't found any obvious problems. |
|
For anyone interested I have this PR with various optimizations (#24231+CUDA, #24011, FA changes) in my repo: https://github.com/fairydreaming/llama.cpp/tree/dsv4 PP is the same as reported above, TG is ~70% faster. |
Thanks! I tried but failed. It looks like antirez's gguf is not yet supported? |
@rujialiu Unfortunately there are multiple naming differences for model parameters and tensors that prevent antirez GGUFs from working with this PR. |
|
@am17an On the other hand maybe it's a good idea to unify the naming with antirez GGUFs? From what I see in files there's only a single difference in tensor shapes - in attention output tensor - [4096, 1024, 8] vs [4096, 8192, 1]. I can try to fix this it in the meantime, what do you think? |
Thanks for the reply. I'm especially interested in trying this REAP version in antirez's format, which (hopefully) is small enough for lower-end machines with only 64GB RAM: |
|
@fairydreaming sure, I think it makes sense to support already existing GGUFs. BTW can you check the latest commit for any perf improvements on your setup? Graph reuse was added across CSA boundaries |
|
@am17an Merged the changes and I see an improvement, TG in Flash now exceeds 20 t/s for short prompts (was around 18): |
|
@rujialiu OK, this is weird. I made a patch that allows antirez DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf that I downloaded some time ago to work in this PR, but your DeepSeek-V4-Flash-Spark-Mini-Q2-REAP-ds4.gguf for some reason causes |
|
Not sure if it's too early for this but I'm noticing a consistently reproducible issue where the model outputs malformed JSX tags during long responses as follows: Happens with both the raw unquantized Q8 GGUF and the quantized Q3 GGUF that I normally use but isn't reproducible with responses over the web/API. Doesn't happen with short responses. Repo used:
Command used for HF -> GGUF: Command used for Quantization: Launch command:
Prompt used for this:
My Setup: 2x RTX 3080 20GB |
@fairydreaming Thanks! I tried that REAP version with cchuter's branch i.e. https://github.com/cchuter/llama.cpp/tree/feat/v4-port-cuda which works with that DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf on my machine (tg ~4 tok/s, pp even slower). I also got: I can't check whether this REAP gguf works with antirez's ds4 because ds4 doesn't support native Windows. I had good experience running Minimax 2.5 REAP with llama.cpp, but I don't have any way to ensure that gguf is sane (or at least works with official ds4). Sorry about that. |
|
@fairydreaming OK, I found that cchuter's branch works with that REAP gguf (actually I tried a slightly larger 180B REAP gguf instead) with |
|
@Lowkey-Loki-SN I think it is something to do with tokenization, it messes up even small JAX templates for me. Mostly extra whitespace. |
Glad to hear it's reproducible on your end too! And yes, it is always either extra whitespace or newlines when it happens on my end |
|
@rujialiu From what I see the problem is that expert indices read from tid2eid tensors during Edit: @am17an is right, I disabled expert offloading (so all CUDA now) and now it works on with ubatch 8 but fails with ubatch 9. |
|
@fairydreaming ubatch 32 is when the offload would kick in, so probably something in cuda backend |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
What's your command to launch the server? |
make -j && ./bin/llama-server -m ./DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf --port 8014 -c 65536 --host 0.0.0.0 -lv 4 |
| if (p0 > 0) { | ||
| // DSV4 compressed cache rows are derived from running compressor state, | ||
| // so arbitrary rollback is not reconstructible from the raw cache alone. | ||
| // Allow the common prompt-cache cleanup no-op: remove [end, infinity). | ||
| if (seq_id >= 0 && p0 > kv_raw->seq_pos_max(seq_id)) { | ||
| return true; | ||
| } | ||
|
|
||
| return false; | ||
| } |
There was a problem hiding this comment.
Without partial sequence removal, are we going to be able to support MTP?
There was a problem hiding this comment.
We can still use checkpoint and do MTP=1. The current partial state is just ~17 Mb so it should be possible to similar to what we do in Qwen for MTP > 1
| // When either raw or compressed state is per-sequence, split ubatches so | ||
| // every token maps cleanly to its stream. This may serialize independent | ||
| // non-unified sequences, but keeps compressed state ownership explicit. | ||
| do { | ||
| balloc.split_reset(); | ||
|
|
||
| std::vector<llama_ubatch> ubatches; | ||
| while (true) { | ||
| llama_ubatch ubatch; | ||
| if (comp_coupled_same_set) { | ||
| ubatch = balloc.split_equal(n_ubatch, false); | ||
| } else if (comp_coupled) { | ||
| ubatch = balloc.split_seq(1); | ||
| } else if (comp_per_seq) { | ||
| ubatch = balloc.split_seq(n_ubatch); | ||
| } else { | ||
| ubatch = balloc.split_equal(n_ubatch, raw_per_seq); | ||
| } | ||
|
|
||
| if (ubatch.n_tokens == 0) { | ||
| break; | ||
| } | ||
| ubatches.push_back(std::move(ubatch)); // NOLINT | ||
| } | ||
|
|
||
| if (balloc.get_n_used() < balloc.get_n_tokens()) { | ||
| break; | ||
| } | ||
|
|
||
| if (auto ctx = make_context(std::move(ubatches))) { | ||
| return ctx; | ||
| } | ||
| } while (false); |
There was a problem hiding this comment.
I'm looking at the multi-sequence change (e16065f) and it seems that it does not accomplish the goal of supporting properly the non-unified KV cache. For context about how the non-unified KV cache should work see #14363. In short, it requires ubatches with equal sequence lengths (i.e. split_equal).
However, the implemented logic always does split_seq. This is the correct thing to do when the non-unified KV cache is not supported by the graph. The idea is that when we use split_seq, we guarantee that each ubatch will only have tokens from a single sequence, so the graph does not need to handle multiple streams. For example, we do the same thing with the recurrent cache when using rollbacks because the non-unified cache currently is not supported there too:
llama.cpp/src/llama-memory-recurrent.cpp
Lines 419 to 423 in 2333185
This is a workaround, not the proper solution. If my understanding is correct, I think a lot of the new logic added in that commit is not necessary because in the end, we still end up using split_seq instead of the desired split_equal. Therefore we can simply workaround by using split_seq, similar to the recurrent cache above and avoid the extra logic.
In the future, we have to rework non-unified KV cache to be properly supported. I'm planning to do it for the recurrent memory first, so that Qwen3.6 runs faster with parallel sequences. For DS4 I was hoping we can start with the correct implementation from the beginning. But if it is too complicated, we can try to do it later.

Overview
Still a WIP, lots of work to do before this is usable. At the current stage it passes long context/tool calling tests but is quite slow. All the complexity is in the new
llama-kv-cache-dsv4+deepseekv4model class + no new ggml ops at the moment.To run you the flash version at least 100 GB VRAM (you can use the antirez's GGUF or use this PR to convert one), for the full flash version 160+ GB. Here's how I was running the server on a DGX spark
llama-server -m dsv4-q2_k.gguf -fa 0 -c 32768 --jinja --chat-template-file models/templates/deepseek-ai-DeepSeek-V4.jinja --fit offNote that it is extremely slow at the moment (~4-5 toks/sec)
Thanks to @pwilkin for the correct chat template + debugging help
Thanks to @fairydreaming for his help in debugging + contributing fixes
Additional information
Requirements